"Dynamic timewarp barycenter averaging: Repairing polyline path information with user trajectory data"
   By Matthew Redmond. 
   Live captioning by Norma Miller.
   @whitecoatcapxg
   
   >> OK, folks, the clock on the wall says 4:00 so we are going to go ahead and get started. This is Matt Redmond.
   >> Hello! So hi, my name is Matt Redmond, I'm a data scientist at Strava. My background personally is in computational geometry and linear algebra. Strava's goal is to build the most engaged community of athletes in the world. Our athletes upload GPS activity, we allow them to compete asynchronously with each other so if you are running by yourself, you could go run the same course as the pros and compare your times. So here's an example athletic activity. I did this yesterday. We went to bane bridge. It was delightful. Took a little ride around and you can see that we plotted this nice little polylypo. So Strava activities in general are between 0 and 200 miles with, you know, some variance in that. Some people ride crazy long, but for the most part that's a reasonable bound on how far our lines are. So as I alluded to earlier, the primary one of the primitives we have at Strava is the concept of a Strava segment. So a Strava segment is an athlete-created polyline to race over. The athlete with the fastest time is the queen or the king of the mountain. For historical reasons, it is always of the mount aye. It sounds better than Regen of the Levy or ruler of the flat place.
   Users, when creating a segment choose a start and end point from their own sensor data. When you upload your activity, your activity is matched against all in the database and we do this so you can track your progress against whatever you're interested in. Unfortunately sometimes segment quality is terrible. Here's an example of what our segment experience looks like. So this is a ride in northern California in the peninsula region. There's a famous climb here and you can see an all thetude chart and also an overlay. So Strava uses OpenStreetMaps with some custom graphic design stuff on top of T so that was good segment date Y here an example of fairly atrocious segment data. You can see, I think this is -- this the in center of San Francisco, this is the mount Sutro climb and whoever built this segment had some interesting GPS noise happening with their device, so the segment that the got recorded initially veers fairly off of the bend there and anybody who chooses to ride this segment in the future will see this and have a bad experience so it's in our interest to not have our users see this sort of degraded experience so one of our goals was to clean this up with a segment repair process.
   So here's the problem with segments, I think fundamentally, segment quality is poor because we have this underlying dependency on activity data and this has a few problems, so for one, the reason that it's based off of activity data is that we expected users to be able to go back and trim their segments and you wanted to store the original stream so you could edit segments after creation. But this is kind of a problem. It's a leaky abstraction. Really in my mind a segment should be just a stretch of the planet. It should be a polyline and it shouldn't have any time-varying data in it. Tying segments to activities also introduces some sampling biases. And what this ends up is oversampling where athletes are slow and undersampling where they're fast. And finally, one another problem with tying segment data to activity is that as sensors improve we'd like our segment quality to improve, but if you're tied to an old fixed activity, you have old sensor data and you lose out all an you will improvements on filtering and all that sort of thing. So one of my goals to fix segments was to build a repair tool and I had a couple of goals in mind for this repair tool. Namely I wanted it to be fast. Ideally I'd run this over our entire dataset. It should be deterministic, I don't want my repair to give me random results. And I'd want it to be item potent and I think the hardest part is I'd like it to be verifiable. I'd like my segment repairs to be quantifiably better in some ways than the original was and if they're not better I'd want them to refuse or reject the repair and keep the original path.
   So the key idea for me was harnessing our segment-matching engine so when you go for a ride or a run, you upload your GPS data, your path is matched against nearby candidate segments. And we identify some possible candidates. Those are stored as tiles, and so there's a state machine that goes through and plays your activity path over the set of tiles and it keeps track of which tiles you've progressed to and which tiles you've jumped out of and at the end of the day it matches some statistics. It's fairly robust. It's been tested it's been in production for quite a while and in particular it's tuned for lenience.
   So we already have this large dataset of people who have been matched for segments and instead of using one activity we can use the collective wisdom of many activities. So to repair a target segment, identify a collection of activities that matched that target. From the training set, you either choose some really good example or with some magic, create a new one and we'll talk about that later. As a repair candidate and then finally what we do with that repair candidate is we test it as a comparison against the original against the testing set. If the statistics are stronger for the repair candidate, you can accept it, otherwise can you reject it. So there's an asterisk here. If we are using matching data on what is assumed to be a known bad segment, it might stand to reason that the training data itself is somehow tainted or corrupted and in theory I think this is actually true but in practice we have enough sort of aggregation power that you can get over that and the repairs turn out pretty well. I was hoping to put one of those little shrugging guys images in there but I don't know how to do that but anyway, we'd like to have some way of choosing an exemplar candidate from that training set. What does that mean? Well, ideally we choose the candidate that looks the most representative of the training set. This problem is a little similar to classic calculus and variations problems where we're trying to find the functional that maximizes or minimizes some objectives and some constraints. In this case we're sort of trying to find the element from the set that minimizes the sum distance to the other elements. So mean and median are generally fairly well defined on discrete metric space, so a metric space is a set equipped with a distance function. So on the real line, the common example of, you know, mean to median is for the mean you compute the sum and you divide by the number of elements and for the median you can sort of sort the set and then choose the middle one and this can be extended fairly naturally to points at higher dimensional spaces, you choose your standard Euclidian norm and the mean becomes the centroid. The median is a little trickier. What you can do is define a medoid, which is the element that minimizes the distance to the other points in the side. so as it turns out the elements in set aren't points. But they're sequences of points. We need a slightly more intelligent distance function than the standard Euclidian distance. So there's a couple distances the first one is house dorf distance, the house dorf distance is intuitive the maximum pointwise distance of a set to the nearest point in the other. That doesn't work great in this context. Another alternative notion of a distance measure of the sequences is the Frechet distance, intuitively if you went walking with your dog, what's the shortest leash that you could get away with and still walk at your own paces and this incorporates some notion of sequential points but it's still a lot of wasted information. We lose a lot of valuable information by only looking at the worst distance.
   >> Dynamic timewarp distance is a robust, well studied distance function between sequences. It's not a metric so it adds a little bit of complexity there. But it historically comes from the speech processing community. People speak at different rates and you'd like to be able to identify one person saying the vowel O and another person saying the vowel O at slightly different frequencies. So they came up with the dynamic timewarp distance by warping sounds in the same domain. Here's a nice picture. City you can look in figure A. Those might be two wave forms of people saying the same letter, and there is a clever way which we'll talk about in a bit to recursively align those sequences, choose the path of indices from Q to C that best represents the correct alignment, so that warp path is shown in figure C. You can see that each point is tied to at least one other point in the sequence and so I don't know if, so you can see here, this point here on the red curve, Q, it looks like it's tied to a couple of points down here in C and similarly, where the pronunciation is made here, you can see those are tied roughly to the corresponding points down there. So timewarp distance then says, we're going to take -- here's how you build it. We're going to define the timewarp distance as the timewarp distance of the last cell in a dynamic programming matrix and we're going to generate it recursively and we're going to say for each index you can either follow -- take one step in the first time sequence, you can take one step in the second time sequence, or you can take a step at the same time in both of them. And then we greedily build this up, and the timewarp distance is the sum of the distance of all pairs of points on that warp path.
   So OK, so suppose let's take it now that we have some sensible notion of distance between sequences of points.
   Let's talk about segment repair again. We'd like to choose either a central element or an exemplar element from our training set so in order to do that, we can equip that set with a dynamic timewarp distance and compute some medians, whoo! If you actually go about and do this in theory it should work. You should be able to take your training segment and it does work. So here is some data from Strava's dataset. This is actually an interesting case, I think this is a running segment in San Francisco, so the purple little arrowheads are people's actual GPS points and the red line in the diagram is the original segment data. It's pretty bad. The various other colored lines are samples with various training sets, the median of that training set. So it works -- we get a better segment than we did initially but I'm still not convinced this is good. We have earlier flaws, we're still tying it to a single activity. So what if we had a better way to do this. So this is another example of one of the segments that gets fixed allittle bit by medoid repair, but not sufficiently. Again, the red line here are originals and the colored lines are sampled medians. So I wish I could take credit for this because it's super clever but I cannot. Rather than choose the best choice, we actually try to synthesize a candidate. So Francois Petitjean came up with this. In parallel you're going to take that guess and run the timewarp alignment from that guess to every element in your training set. In practice, we use about 30 elements in the training set. This is pretty fast. Each point in the repair candidate is tied to at least one point from every other training sequence and practice that's usually tied to one or two points which is pretty efficient, and then for each of those points we're going to clump them up and compute the barycenter and update that point to its barycenter and here, again picture it a lot more than talking about it. You can start with some arbitrary sequence here, this is a pretty terrible guess at what the consensus might be for these two pieces of training data and we start down here and we are going to align this whole sequence to the redealing thing and align this sequence to the blue thing, so after this iteration, we replace this point with a centroid of this point, this point, and this point. Next up, this point after alignment gets aligned to this point and this point and we do this iteratively until we see something that looks more like the right diagram. So already you can see that the black line is converging towards a more sensible guess as what the consensus sequence should be, and in practice you run this until you reach some sort of tolerance and you don't get much movement.
   So in practice, in order to do this quickly, you have to preprocess your data. Training data sometimes if you go for a run and you sit down at a cafe and you leave your GPS running you're going to keep generating data on that bench, but in your run, you might actually sit on that bench. So in order to combat that we pass all this data through a polyline simplification path which is parameterized. Running this filter around over it reduces it to 400 or so, additionally we restrict our training set to data that rooms from devices recorded on barometric alTim terse. Instead of -- the timewarp distance is quadratic, it's quite slow, but you can actually build this approximation algorithm, another very clever guy came up with this, that sort of instead of looking at the whole window for the dynamic timewarp, it instead looks at points near the main diagonal, so it projects a coarse warp path onto the diagonal. Compute's the neighbor's network path, but instead of rerunning all of the cells it only runs over the gray cells.
   In practice, this is implemented as a C-native library. As far as I know this is the fastest implementation of this. Fairly sure it's the fastest. I started off with the Scala implementation and that was a bad idea. It was quite slow, but switching to sort of closer to the meta code was quite efficient.
   So let's see finally post processing on this dataset, once we have our repair it's important to clean up sort of the rough edges, see we polyline filter that, we run through a median filter and then we run a quadratic filter to smooth it out. So let's take a look at that. So here is some bad data. This is a climb in California and you can see sort of the start of the climb, I don't know what the heck is going on. Someone is riding like crazy and there's some diagonal cutting across the road and it's basically just GPS noise. But there's enough data we can build a sensible repair. Here's what it looks like after repair. Now, I want to stress that we don't have to be perfect in our segment repair, we just have to be better. So I would not claim that this blue line is perfect. I think there's some strange stuff in this blue line. But we have validation statistics, so we can rerun our matching engine over the original data and over the repair data and see that the match progress in this case got about.46 percent better and the jump quality which is how much it jumps away from the tile set got about 1.5% better so this is a pretty common result. It's not phenomenally better but it's better enough that I would accept this repair.
   So right now this is an admin tool for Strava support that lets them manually repair segments but over in the process of running this repair tool over all of our dataset.
   So finally a little bit of a cumulative wisdom. This is mostly from engineering wisdom. It's important to handle altitude extremes separately. There are a lot of variance, especially from barometric data that comes to you as altitude difs. Switching to a native library was very important. Preprocessing data by simplifying the polyline's biggest simplification. And sort of organizational terms, build adviculizations early was really essential. And finally building something you're comfortable throwing away. For me I don't like throwing away my babies, but it was important to have a disposable prototype. Great, so I've stolen a bunch of code and diagrams from these papers but they are delightful and they're quite easy to read, so it's great to read if you're interested. Thanks, time for questions?
   If you have a question, please wait your hand and please wait for the mic to get to you so that the video can pick up your question.
   AUDIENCE MEMBER:  It will you run into principal curve analysis when you were doing the research on this.
   >> So the question was, did I run into principal curve analysis and the answer is no, tell me about that.
   >> It's a paper by a guy at I think the university of Montreal or something like that, where it's taking essentially probe data and trying to figure out what curve best fits a set of points essentially by minimizing distance metrics. It seems somewhat similar.
   >> It's very similar, yeah. I'll have to look at that.
   >> I'll send up the papers I have and the sample code that the researcher wrote.
   >> Awesome, great, thank you, yeah.
   AUDIENCE MEMBER:  How does this relate or compare to the slide tool from Strava a couple of years ago.
   >> Great. Good question. So the slide tool. I think someone is here who can answer questions about that, but the slide tool uses our heat map data, so it takes all of the activities that pass through a given point in space and it builds sort of a 3D surface that that has a minimizer passed across it and in this way it sort of allows you to run the sort of process over the entire world at once. Because it's using all of the activity data. I was interested in more of a targeted segment level thing. Additionally, the slide tool is very -- it's like it's very well suited for like large polylines and I wanted to focus on something a little smaller. OK?
   >> Any other questions?
   AUDIENCE MEMBER:  So the last step that you described was manual review and acceptance. Are there any plans to fully automate or how would you approach that?
   >> Yes, so I've built some code into place that sort of looks like a test statistic like how likely would it be that we would see these matching statistics given that in fact that it is not statistically significantly better. So automating acceptance would basically I would imagine it would be looking at the distribution of those test statistics and then building some robust test to accept or reject automatically.
   Anyone else?
   >> No? Oh, one more. OK.
   AUDIENCE MEMBER:  So I guess I'm just wondering, this might be a quick question, but do you have any sense why we would see things like that blue bulge any think it was slide 27, where the quote improved trail has some strange happenings?
   >> I have some conjectures. One of them has to do with -- I think it could be an artifact of the window width on the median filtering. So choosing -- it's just another parameter to tune and I think you could be seeing some sort of like smoothing artifacts there from people who maybe pulled into the parking lot in that particular case. It's hard to tell. But usually that sort of thing when I was debugging it it was usually window width parameter sort of thing.
   >> Cool. Thanks.
   >> Yeah.
   >> Anyone else?
   >> OK, thank you, Matt. Thanks.
   [applause]